Aug 2019

Problem Statement

Attrition:

  • Problem that impacts all businesses
  • Leads to significant costs for a business
  • Including the cost of business disruption, hiring new staff and training new staff.


  • Underderstanding the drivers is crucial
  • Classification models to predict if an employee is likely to quit could greatly increase HR’s ability to intervene on time and remedy the situation to prevent attrition.

Dataset

  • \(n = 1470\) employees and some of their attributes (selection below):
Name Description
AGE Numerical Value
GENDER (1=FEMALE, 2=MALE)
EDUCATION Numerical Value
BUSINESS TRAVEL (1=No Travel, 2=Travel Frequently, 3=Tavel Rarely)
DISTANCE FROM HOME Numerical Value - THE DISTANCE FROM WORK TO HOME
JOB SATISFACTION Numerical Value - SATISFACTION WITH THE JOB
MONTHLY INCOME Numerical Value - MONTHLY SALARY
NUMCOMPANIES WORKED Numerical Value - NO. OF COMPANIES WORKED AT
OVERTIME (1=NO, 2=YES)
PERCENT SALARY HIKE Numerical Value - PERCENTAGE INCREASE IN SALARY
PERFORMANCE RATING Numerical Value - ERFORMANCE RATING
TOTAL WORKING YEARS Numerical Value - TOTAL YEARS WORKED
TRAINING TIMES LAST YEAR Numerical Value - HOURS SPENT TRAINING
WORK LIFE BALANCE Numerical Value - TIME SPENT BEWTWEEN WORK AND OUTSIDE
YEARS AT COMPANY Numerical Value - TOTAL NUMBER OF YEARS AT THE COMPNAY
YEARS SINCE LAST PROMOTION Numerical Value - LAST PROMOTION
ATTRITION Employee leaving the company (0=no, 1=yes)
  • Target variable Attrition: Whether or not an employee has quit

Data Exploration

  • Target variable Attrition imbalanced
table(dat_raw[["Attrition"]]) %>% prop.table()
       No       Yes 
0.8387755 0.1612245 
  • No missing values in the data
  • Some features with no variation:
    • EmployeeCount: constant, always 1
    • Over18: constant, always Y
    • StandardHours: constant at 80

Data Exploration

  • Feature correlations:
cormat_long %>% 
  filter(Var1 != Var2, abs(value) > .8) %>% 
  arrange(value)
           Var1              Var2      value
1 MaritalStatus  StockOptionLevel -0.8131347
2    Department           JobRole  0.8508444
3      JobLevel TotalWorkingYears  0.8523883
4      JobLevel     MonthlyIncome  0.9675631

Heterogenous correlation matrix:

  • Pearson correlations,
  • Polyserial correlations, and
  • Polychoric correlations
cormat_hetcor <- polycor::hetcor(
  as.data.frame(dat_all), 
  std.err = FALSE, use = "pairwise")

Features used in the model

  • Features used in the model:
varnames_features
 [1] "Age"                      "BusinessTravel"           "DailyRate"               
 [4] "DistanceFromHome"         "Education"                "EducationField"          
 [7] "EnvironmentSatisfaction"  "Gender"                   "HourlyRate"              
[10] "JobInvolvement"           "JobRole"                  "JobSatisfaction"         
[13] "MaritalStatus"            "MonthlyIncome"            "MonthlyRate"             
[16] "NumCompaniesWorked"       "OverTime"                 "PercentSalaryHike"       
[19] "PerformanceRating"        "RelationshipSatisfaction" "StockOptionLevel"        
[22] "TotalWorkingYears"        "TrainingTimesLastYear"    "WorkLifeBalance"         
[25] "YearsAtCompany"           "YearsInCurrentRole"       "YearsSinceLastPromotion" 
[28] "YearsWithCurrManager"    
  • Features excluded:
    • Features with no relevant information: EmployeeNumber
    • Constant features: EmployeeCount, Over18, StandardHours
    • Highly correlated features: JobLevel, Department

Categorical variables

  • Variable description did not provide scale level
  • Hence, some features were assumed to be categorical and dummy-coded:
varnames_convert_to_cat
[1] "WorkLifeBalance"          "StockOptionLevel"         "RelationshipSatisfaction"
[4] "JobSatisfaction"          "JobLevel"                 "JobInvolvement"          
[7] "EnvironmentSatisfaction"  "Education" 
  • All of them are in a range of [1, 4] or [1, 5]
  • No information about how the information was collected
  • Might be a Likert-Scale (ordinal or even interval scale)
  • But might also be totally unrelated options (nominal scale)
  • Assumed to be ordinal scale, just to be on the save side

Assumptions: Summary

  • Data on individual level and not aggregated
    (even though there is a variable EmployeeCount)
  • Some variables were assumed to be nominal scale,
    even though they might be on an ordinal scale

  • In general, data was assumed to be “okay”,
    even though some variables would warrant some questions, e.g.:
    • DailyRate: Seemingly arbitrary numbers
      • not at all being associated with MonthlyIncome (\(r = .008\)) seemed strange
    • MonthlyRate: Also seemingly arbitrary numbers
      • not associated with any variable

Train-/Eval-/Test-Split

  • Data was split into 3 parts using random sampling:
    • Training set: \(80\%\), \(n = 1180\)
    • Evaluation set: \(10\%\), \(n = 135\)
    • Test set: \(10\%\), \(n = 155\)
  • Distribution of target variable in each part remained essentially unchanged:

Upsampling the Training Set

  • To balance out the classes, SMOTE sampling was used for the training set
    • Synthetic Minority Oversampling Technique
    • Undersamples the majority class
    • Creates synthetic examples of the minority class
    • by randomly varying the features of \(k\) nearest neighbours (\(k = 5\) in this case)
  • Validation and test set remain untouched
set.seed(9560)
formula_smote <- paste0(varnames_target, " ~ .") %>% 
  as.formula()
dat_model <- SMOTE(formula_smote,   ##  Attrition ~ .
                   data  = dat_model_imbal %>% as.data.frame())    

Machine Learning Models

  • A number of models were fitted to the training data:
    • Logistic regression (R’s base package)
    • Elastic net regression (glmnet package)
    • Random forest (ranger package)
    • Boosted GLMs (glmboost from the model-based boosting package mboost)
    • XGBoost (xgboost package)
    • AdaBoost (adaboost package)
    • Neural net with 1 hidden layer (nnet package)

Data Preparation and Model Fitting

  • Data preparation: nominal variables were manually dummy-coded for model tuning and (initial) model fitting (necessary for applying XGBoost, unfortunately)
  • All model tuning and fitting was performed using the mlr package

  • Parameter tuning: 50 iterations of random search with 6-fold CV within the training set
  • Main performance measure: Matthew’s Correlation Coefficient (MCC)
    • basically is the correlation of true and predicted labels
    • suitable for imbalanced samples
## set random seed, also valid for parallel execution:
set.seed(4271, "L'Ecuyer")

## choose resampling strategy for parameter tuning:
rdesc <- makeResampleDesc(predict = "both", 
                          method = "CV", iters = 6)

## parameters for parameter tuning:
n_maxit <- 50
ctrl <- makeTuneControlRandom(maxit = n_maxit)  
tune_measures <- list(mcc, auc, f1, bac, acc, mmce, timetrain, timepredict)

Model Fitting and Evaluation

  • For evaluation of performance stability and over-/underfitting
    • Models were fitted with 3x repeated 5-fold cross-validation
    • Within the training set
    • Using tuned parameters
  • For evaluation of model performance
    • Models were re-fitted on the complete training set
    • And performance was evaluated on the evaluation set
  • Subset of models was re-fitted on complete training set
    • using non-dummy-coded data
    • for easier model interpretation
  • Final evaluation of performance on the test set

Results

Model Performance Stability

  • Large spread in general (~\(0.15 \Delta \mbox{MCC}\))
  • Spread similar for most models
  • Higher spread for neural network
  • Performance can’t be judged within the training set because of SMOTE-sampling
  • Severe overfitting for some models (ranger, XGBoost, AdaBoost, neural net)
  • Least overfitting for elastic net regression
bmr_train_summary %>% select(matches("learner|mcc"))
        learner.id mcc_train.train.mean mcc.test.mean
1   classif.logreg            0.6108784     0.5515077
2   classif.glmnet            0.5985743     0.5406190
3   classif.ranger            0.9632905     0.7085456
4 classif.glmboost            0.6033816     0.5381702
5  classif.xgboost            1.0000000     0.7962943
6      classif.ada            0.9814215     0.7535416
7     classif.nnet            0.8474984     0.5990635

Performance in Evaluation Set

  • Boosted GLMs have highest performance in test set (MCC)
  • Followed by elastic net regression and logistic regression

  • XGBoost, AdaBoost and ranger: worse performance


Performance in Evaluation Set (cont’d)

  • Other performance measures (AUC, accuracy): glmboost, elastic net and logistic regression are top contenders
  • Except balanced accuracy: XGBoost and random forest do slightly better, but worse in all other measures
  • Elastic net showed the least overfitting (earlier slides)
dat_perf_eval_rnd
     model     mcc     auc     bac     acc
1   logreg   0.560   0.870   0.726   0.881
2   glmnet   0.593   0.873   0.700   0.889
3   ranger   0.446   0.801   0.635   0.859
4 glmboost   0.625   0.872   0.720   0.896
5  xgboost   0.542   0.785   0.737   0.874
6      ada   0.465   0.847   0.708   0.852
7     nnet   0.443   0.813   0.715   0.837

Performance in Test Set

  • Elastic net does best (also showed least overfitting before)
  • Most others, especially gradient boosting methods, drastically overestimated performance in the evaluation set
    (see next slide)
  • Drop by about \(.1 - .2\) in MCC

dat_perf_test_rnd
     model     mcc     auc     bac     acc
1   logreg   0.468   0.830   0.659   0.884
2   glmnet   0.545   0.832   0.667   0.897
3   ranger   0.422   0.798   0.621   0.877
4 glmboost   0.464   0.822   0.642   0.884
5  xgboost   0.363   0.740   0.644   0.858
6      ada   0.425   0.789   0.668   0.871
7     nnet   0.373   0.771   0.683   0.839

Performance Generalization Eval/Test

Most important Features

Most important features:

  • Working overtime
  • Job role
  • Environment satisfaction
  • Total working years
  • Number of companies worked for
  • Business Travel

Measured by:

  • Increase in classification error (CE) when shuffling the feature
  • Used 50 repetitons to increase stability of estimation

Most important Features

Comparing variable importance of 3 top models (logreg, glmnet, glmboost):

  • In top-5 features, 3 overlap:
    • OverTime
    • JobRole
    • EnvironmentSatisfaction
  • In top-8 features, 4 overlap:
    • BusinessTravel (in addition to top-5 overlapping features)
  • Low agreement between models, not very sound findings
  • Even despite the fact that all models are linear

Feature Effects: Overtime

  • ICE plot: Independent Conditional Expectations
    • Predicted probability when feature is shuffled (others untouched)
    • For all observations individually
      (dots or thin lines)
    • And summary measure (median or mean)
    • For categorical variables: probability for each outcome
      (Focus on right plot: Probability for Attrition == Yes)
  • Higher probability for attrition when working overtime

Feature Effects: Job Role

  • Highest risk for Sales Representatives, Lab Technicians and HR

Feature Effects: Environment Satisfaction

  • Higher satisfaction is associated with lower probability of attrition
  • Might actually be on a continuous scale, but was assumed to be categorical

Feature Effects: Business Travel

  • The more travel, the higher the attrition risk

Feature Effects: Total Working Years

  • More working years are associated with lower probability for attrition
  • Correlated with age (similar effect)
  • Not all top-3 models agree that this effect is among the important ones
    (but they do agree on the direction of the effect)

Feature Effects: Number of Companies worked

  • The more jobs someone held, the higher the probability for attrition
  • Not all top-3 models agree that this effect is among the important ones
    (but they do agree on the direction of the effect)

Feature Effects: Work Life Balance

  • Was treated as categorical variable: Lowest risk in category 3
  • In case that this is assessed via a Likert scale, that might be of interest
  • Assuming high values are “good work life balance”, this seems to indicate that work life balance can be too good…

Discussion

  • Results not very surprising: Higher probability for attrition associated with
    • Working overtime
    • travelling
    • low environment satisfaction
  • Similar, but with lower importance and stability
    • fewer working years
    • more jobs held
  • Work-life-balance effects somewhat interesting (if effect can be trusted)

Discussion (cont’d)

  • Model performance only mediocre at best
  • SMOTE sampling might not have helped, in hindsight
  • Effects of features are not very strong
  • Other features might be more valuable:
    • Management style
    • Flexible working time
    • Amount and quality of team work
    • Feedback and recognition, etc.

Summary

  • Business stated that “Classification models to predict if an employee is likely to quit could greatly increase HR’s ability to intervene on time and remedy the situation to prevent attrition”
  • Business goal definitely not met with this analysis / dataset:
    can only serve to help understand drivers of attrition after the fact
  • For prediction of future attrition
    • time-related features (timestamps, changes in satisfaction, etc.) and
    • hidehout window would be needed
      (how far in the future to predict? what time needed to act?)
    • (as well as better model quality)

Thank you.